Note: You have to run this notebook to view the visualizations. Alternatively, you may view the visualizations in Text-Clustering-with-Top2Vec.html.

Import the necessary libraries

Load dataset

Exploratory Data Analysis

To cluster the news, we have to first convert its text content into numerical representation i.e. text embeddings. There are several approaches that we can take to create the text embeddings (i.e. term frequency, TF-IDF, Word2Vec, GloVe, Doc2Vec, Universal Sentence Encoder, BERT family of models etc).

For this exercise, we will use a pre-trained model from BERT family of models to create the text embeddings for the news. The BERT family of models approach is taken because its models are able to capture the semantic and contextual meaning of the text and also the word order. In particular, we will use the pre-trained model, all-MiniLM-L6-v2 from SentenceTransformers library to create the text embeddings for the news. The choice of model is arbitrary; any model from here should also deliver decent results so long as it was trained on English text.

BERT family of models however have a maximum sequence length of 512 tokens due to computational and memory constraints. Our chosen model was trained on data with a maximum sequence length of 128 tokens; it performs best when it is fed with inputs with less than or equal to 128 tokens.

Let's check the news dataset to see if any of the news content violates the 128 tokens constraint. If yes, we might need to perform additional processing steps to abide by the constraint.

Number of Tokens

It can be observed that even the shortest news content, "Dementieva prevails in Hong Kong" has 200 tokens, which exceeds the maximum sequence length of 128. By default, if we pass inputs with > 128 tokens into the model, the model will truncate the input sequence and make an inference only based on the first 128 tokens. This results in information loss, especially for longer news content.

One way to resolve this would be to split the news content into sentences and then obtain the text embeddings for each sentence. Clustering would also be done on a sentence level. Sentences in general are shorter and hence less likely to violate the 128 tokens constraint.

Let's check if any of the sentences has more than 128 tokens

With the exception of 1 outlier (i.e. 131 tokens), all the sentences have less than 128 tokens. Even for the outlier, information loss due to truncation is minimal; only 5 tokens are truncated off (Although the maximum sequence length is 128 tokens, we have to allocate 2 token slots for the CLS and SEP tokens, this leaves us with only space for 126 tokens, hence number of truncated tokens = 131 - 126 = 5.

Top 50 Terms in the News

The top 50 terms in the news are wide ranging. This suggests that the news covers a wide range of different topics.

Text Embeddings

From the plot above, it could be observed that there are at least 5 dense clusters in the document embeddings space. In some of the dense clusters, there are sentences with different news titles within the same cluster. This suggests the presence of common themes among some of the news content with different titles.

Clustering of Text Embeddings

There are several clustering algorithms (i.e. KMeans) that we can use to group the sentences. But for this exercise, we will use a modified version of Top2Vec algorithm to cluster the sentences; we use a different approach from the original Top2Vec for the identification of topic words in the topic interpretation step.

Top2Vec algorithm was chosen over other clustering algorithms due to the benefits that it confers.

Benefits of Top2Vec

  1. Automatically finds the number of topics.

  2. No need for stop words removal.

  3. No need for stemming/lemmatization.

How Top2Vec works

  1. Create document embeddings using a pre-trained Sentence Transformer model.

    Documents with similar semantic meanings would be placed close together in embedding space.

  2. Apply (UMAP) to compress the document embeddings into lower dimensions.

    Dimensionality reduction reduces the sparsity of the document embeddings, which helps in finding dense clusters.

  3. Find dense clusters of documents using HDBSCAN.

  4. For each dense cluster, calculate the centroid of the document embeddings in original dimension. This centroid is the topic vector.

  5. Merge duplicate topics using DBSCAN.

Topics Overview

The intertopic distance map below gives an overview of the topics identified in the news dataset.

Topic Sizes

12 clusters/topics were identified from the news dataset. Each cluster/topic is observed to contain more than one news title; this confirms our earlier conjecture that some of the news share the same common themes. It could also be observed that some of the news have content that spans across different clusters/topics. For instance, the news "Huge rush for Jet Airways shares" and "£1.8m indecency fine for Viacom" spans across 3 cluster/topics (i.e. topic 2, 3 and 4) and 5 clusters/topics (i.e. topic 1, 3, 5, 7 and 8) respectively.

Topic Interpretation

We identify the top n words in each topic using c-TF-IDF scores. How this works is that for each topic, we join all the documents from that topic into a single document; i.e. each topic will have one joined document. We then calculate TF-IDF scores based on the joined documents.

For each topic, we interpret the topic by looking at the top n words in the topic and the top n sentences most representative of topic (computed based on cosine distance from the topic vector).

For this exercise, we will only analyze the top 3 largest topics (i.e. topic 0, 1 and 2).

Topic 0: Sports News

Topic 1: Political News

Topic 2: Economic News

Predicting the Topic of New News Content

We can predict the topic of a new news content using the pipeline described below. The pipeline assumes that we already have a fitted topic model with topic vectors computed.

Pipeline

  1. Split the news content into sentences.

  2. Create text embeddings for each of the sentences using all-MiniLM-L6-v2 Sentence Transformer model.

  3. For each sentence, we calculate the cosine similarity between the sentence emebdding and each of the topic vectors; we then assign the sentence to the topic that it is most similar to (i.e. the topic with topic vector that has the highest cosine similarity with the sentence embedding of the sentence) if the cosine similarity score exceeds a certain predefined threshold. Otherwise, we treat that sentence as an outlier that does not belong to any of the topics. The predefined threshold acts as a safeguard to ensure that the sentence genuinely belongs to its assigned topic.

  4. We can deduce the topics covered in the news content based on the topics assigned to each of its sentences.

Predict Topic of New News using Fitted Topic Model

Let's run some predictions using our fitted topic model.

Expected Output:

1.The first news is about animals, a topic not in our topic model; the model should classify this as an outlier with topic -1.

2.The second news is about politics; the model should classify this to topic 1.

3.The third news is about sports; the model should classify this to topic 0.